A visual positioning model for UAV’s patrolling video sequence images based on DOM rectification

With technological development of multi sensors, UAV (unmanned aerial vehicle) can identify and locate key targets in essential monitoring areas or geological disaster-prone areas by taking video sequence images, and precise positioning of the video sequence images is constantly a matter of great concern. In recent years, precise positioning of aerial images has been widely studied. But it is still a challenge to simultaneously realize precise, robust and dynamic positioning of UAV’s patrolling video sequence images in real time. In order to solve this problem, a visual positioning model for patrolling video sequence images based on DOM rectification is proposed, including a robust block-matching algorithm and a precise polynomial-rectifying algorithm. First, the robust block-matching algorithm is used to obtain the best matching area for UAV’s video sequence image on DOM (Digital Orthophoto Map), a pre-acquired digital orthophoto map covering the whole UAV’s patrolling region. Second, the precise polynomial-rectifying algorithm is used to calculate accurate rectification parameters of mapping UAV’s video sequence image to the best matching area obtained above, and then real time positioning of UAV’s patrolling video sequence images can be realized. Finally, the above two algorithms are analyzed and verified by three practical experiments, and results indicate that even if spatial resolution, surface specific features, illumination condition and topographic relief are significantly different between DOM and UAV’s patrolling video sequence images, proposed algorithms can still steadily realize positioning of each UAV’s patrolling video sequence image with about 2.5 m level accuracy in 1 s. To some extent, this study has improved precise positioning effects of UAV’s patrolling video sequence images in real time, and the proposed mathematical model can be directly incorporated into UAV’s patrolling system without any hardware overhead.

First, extract datum-DOM (number 4) from region-DOM (number 3) according to the POS data of video frame (number 2), and replace region-DOM (number 3) by datum-DOM (number 4) as a new matching area for video frame (number 2), so as to reduce matching area of video frame (number 2) on region-DOM (number 3) and increase matching speed.
Second, extract block-matched-DOM (number 5) from datum-DOM (number 4) by using the proposed robust block-matching algorithm.It should be noted that, video frame (number 2) and block-matched-DOM (number 5) have the same size in pixels, but the matching accuracy between these two images is still poor due to numerous negative factors.Therefore, a further optimization step is needed.
Third, figure out accurate rectification parameters for mapping video frame (number 2) to block-matched-DOM (number 5) by using the proposed precise polynomial-rectifying algorithm.
Finally, obtain geodetic coordinates of each pixel in video frame (number 2) by using the accurate rectification parameters calculated above, so as to realize the real time positioning of video frame (number 2).

Algorithm framework
The algorithm flow of this study is shown in Fig. 2. Advantages lie in the proposed robust image-block-matching algorithm and precise polynomial-rectifying algorithm, which can solve geodetic coordinates of all pixels in a UAV's real-time video frame with about 2.5 m level accuracy in 1 s.

The visual positioning model Extraction of datum-DOM
Following the basic idea of this paper, datum-DOM should be extracted from region-DOM at the beginning, so as to reduce matching area of video frame on region-DOM and increase matching speed.As shown in Fig. 1, Central point's coordinates of datum-DOM is determined by geodetic coordinates of UAV's POS data; Azimuth of datum-DOM is determined by yaw angle of UAV's POS data; Length and width of datum-DOM in pixels is determined by equations as: where, L pixels and W pixels are length and width of datum-DOM in pixels respectively; L dist = H fly ×L CMOS /f ; W dist = H fly ×W CMOS /f ; L CMOS and W CMOS are physical length and width of UAV's CMOS (Complementary Metal Oxide Semiconductor) sensor respectively; f is focal length of UAV's camera; gsd D is spatial resolution of datum-DOM; n is scaling coefficient, ranging from 1.5 to 2.
Finally, datum-DOM can be extracted from region-DOM according to the already known parameters L POS , B POS , Yaw pos , L pixels , W pixels .Where, (L POS , B POS ) are central point's coordinates of datum-DOM; Yaw pos is yaw angle of UAV's POS data; L pixels and W pixels are obtained from Eq. (1).

Construction of robust block-matching algorithm
Follow the basic idea of this paper, the best matching area for video frame on datum-DOM should be extracted.However, existing image feature matching methods are all difficult to match video frame and datum-DOM automatically, since illumination conditions, surface specific features, projection modes and spatial resolution of these two kinds images are greatly different.Therefore, a robust block-matching algorithm is constructed for the purpose of finding out the best matching area for video frame on datum-DOM.Block-matching roughly based on RGB color At this step, the best matching area for video frame on datum-DOM can be found out based on the similarity of these two images in RGB color space.As shown in Fig. 3, (x L1 , y L1 ) are pixel coordinates of the top left corner of the best matching area for video frame on datum-DOM in RGB color space, and (x L1 , y L1 ) can be obtained as: where, ; N LF and N WF are length and width of video frame in pixels respectively; N LD and N WD are length and width of datum-DOM in pixels respectively; N F is the total pixel numbers of video frame, N F = N LF N WF .

Block-matching roughly based on gradient magnitude
At this step, the best matching area for video frame on datum-DOM can be found out based on the similarity of these two images in gradient magnitude space.As shown in Fig. 3, (x L2 , y L2 ) are pixel coordinates of the top left corner of the best matching area for video frame on datum-DOM in gradient magnitude space, and (x L2 , y L2 ) can be obtained as: where,   ; F x (x, y) and F y (x, y) are first partial derivative of video frame in x and y direction respectively; D x (�x 2 + x, �y 2 + y) and D y (�x 2 + x, �y 2 + y) are first partial derivative of datum-DOM in x and y direction respectively; (x, y) are pixel coordinates in video frame, ; N LF and N WF are length and width of video frame in pixels respectively; N LD and N WD are length and width of datum- DOM in pixels respectively;

Block-matching robustly
In practice, it has been found that the above proposed RGB based block-matching method exhibits better performance in video frame with large color difference and complicate texture, while the above proposed gradient magnitude based block-matching method exhibits better performance in video frame with small color difference and simple texture.Therefore, it is necessary to further construct a robustly block-matching method by considering both color difference and texture complexity of video frame.
In the robustly block-matching method, symbol TH is proposed to comprehensive represent color difference amplitude and texture complexity of video frame, and a threshold number 20 is selected to judge TH .If TH ≤ 20 , the video frame is considered to have large color difference and complicate texture, and the matching result in section "Block-matching Roughly Based on RGB color" should have a lager weight.On the contrary, if TH > 20 , the video frame is considered to have small color difference and simple texture, and the matching result in section "Block-matching Roughly Based on Gradient Magnitude" should have a larger weight.TH is calculated in Eq. ( 5), and the threshold number 20 is selected by numerous practical experiments.
As shown in Fig. 3, (x L , y L ) are coordinates of the top left corner of the best matching area obtained by the proposed robustly block-matching method, and (x L , y L ) can be calculated as: where, ω L is a weight, (NLF /10) 2 +(N WF /10) 2 , equations of r , ω L and TH are all constructed by numerous practical experiments; x L 1 , y L 1 and x L 2 , y L 2 are obtained by Eqs. ( 3) and ( 4) respectively; N LF and N WF are length and width of video frame in pixels respectively; N F is the total pixel numbers of video frame; TH represents color difference and texture complexity of video frame; the threshold number 20 is selected by numerous practical experiments; Meaning of the rest parameters can refer to Eqs. ( 3) and (4).
Extracting block-matched-DOM According to parameters (x L , y L , N LF , N WF ) calculated in Eq. ( 5), Block-matched-DOM can be extracted from datum-DOM.As shown in Fig. 3, block-matched-DOM is the area in blue box marked by number 5, and is the best matching area for video frame on datum-DOM ultimately found.
It should be noted that, video frame and its corresponding block-matched-DOM have the same size in pixels, and geodetic coordinates of each pixel on video frame can be obtained directly from the geodetic coordinates of pixels at the same position on block-matched-DOM.That is to say, positioning of UAV's patrolling video frame can be realized by directly assigning geodetic coordinates of each pixel in block-matched-DOM to pixels at the same position in UAV's patrolling video frame.

Construction of precise polynomial-rectifying algorithm
Unfortunately, there is a high probability that pixels in video frame are not homologous with pixels in blockmatched-DOM at the same position, due to numerous negative factors, such as illumination variation, inconsistent spatial resolution, diverse surface specific features, topographic relief, camera distortion, different projection modes and etc.That is to say, the positioning accuracy of video frame is still poor, if we assign geodetic coordinates of each pixel in block-matched-DOM directly to pixels at the same position in video frame.In order to realize accurate positioning of UAV's patrolling video sequence images, a precise polynomial-rectifying algorithm is further constructed.
The basic idea of the proposed precise polynomial-rectifying algorithm is to find out homologous regions in block-matched-DOM for regions in video frame, so as to figure out accurate rectification parameters for mapping video frame to block-matched-DOM.And finally, accurate positioning of video frame can be realized by using accurate rectification parameters to calculate geodetic coordinates of each pixel in video frame.It should be noted that, we are committed to find out homologous regions between video frame and block-matched-DOM, instead of finding out the homologous points.Because homologous regions are more stable and reliable than homologous points under numerous negative influences.Where, homologous regions refer to the most similar local areas between two images.
Through in-depth study of common characteristics between block-matched-DOM and video frame, the precise polynomial-rectifying algorithm is constructed based on three assumptions: (1) video frame and blockmatched-DOM can be regarded as two adjacent sequence images.(2) Overall surface features are similar between video frame and block-matched-DOM.(3) Pixels in a local area of the video frame share a same deformation law.

Constructing polynomials of video frame and that of block-matched-DOM
In order to reduce negative influence of illumination variation, gradient magnitude images of video frame and that of block-matched-DOM are used for image matching.In order to further reduce negative influence of diverse surface specific features, gradient magnitude images of video frame and that of block-matched-DOM are represented by second-order polynomials respectively, and the second-order polynomials of these two images are used for image matching ultimately.
As shown in Fig. 4, gradient magnitude images of video frame and that of block-matched-DOM are evenly divided into n × n local areas respectively, and each of the local areas is represented by a second-order polyno- mial as: where, f F ij (X I , T I ) and f D ij (X � , T � ) are intensity of local area of row i and column j in Fig. 4a,b respectively

Constructing differential-difference polynomials
Block-matched-DOM is the best matching area for video frame on datum-DOM.However, there are still irregular motion displacements between homologous regions of these two images due to numerous negative factors.Therefore, finding out homologous regions of these two images is important for precise positioning of video frame.
Based on the assumption that video frame and block-matched-DOM can be regarded as two adjacent sequence images, the second-order polynomials of video frame and that of block-matched-DOM can also be regarded as two adjacent sequence images.And then, differential-difference polynomials can be constructed based on Eq. ( 6), and further can be rewritten by using Taylor expansion for X to the first order derivative as: where, f F (X I , T I ) and f D (X � , T � ) are intensity of the corresponding local areas in Fig. 4a,b respectively; X I and X are pixel coordinates in local areas of Fig. 4a,b respectively; T I and T are production time of video frame and that of block-matched-DOM respectively; As shown in Fig. 5, X is a small motion displacement from a local area of video frame to the corresponding local area of block-matched-DOM.That is to say, homologous regions between video frame and block-matched-DOM can be obtained by finding out X that can minimizes d in Eq. (7).
In Eq. ( 7), let d be exactly equal to zero, we can obtain as:  www.nature.com/scientificreports/Further, we can obtain equations of X as: where, Constructing precise rectifying equations X In Eq. ( 9) can be also regarded as registration errors between video frame and block-matched-DOM.These registration errors are supposed to be caused by video frame's scaling, displacement, rotation, distortion and etc.And then, X can be also represented by second-order polynomials as: where, x f , y f are coordinates of a local area in video frame; According to Eqs. ( 9) and ( 10), precise rectifying equations can be constructed ultimately as: Where,

Constructing optimal estimation model
As shown in Eq. ( 11), the task of finding out X is converted to find out t , and each pair of local areas in Fig. 4a,b can construct 3 equations.That is to say, 3n 2 equations can be constructed in the form of Eq. ( 11), as there are n 2 pairs of local areas in Fig. 4a,b.
According to the presumption that the minimum energy difference should exist between video frame and block-matched-DOM in homologous regions, the optimization criteria for the 3n 2 equations that are constructed in the form of Eq. ( 11) can be proposed as: where, V = At − L , V is a vector of residual errors; is a weight matrix;t is a vector of unknown parameters; Meaning of the remaining parameters refer to Eq. (11).
In order to obtain the optimal estimation of t , following iteration process are recommended.① Down-sample images and construct k-layer image pyramids for video frame and block-matched-DOM.② Set = I , I is an identity matrix; Set i = k and t = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) T .③ Construct matrix A and L according to the ith layer images of pyramid.④ Calculate correction vector for t as: �t = (A T �A) ⑤ Calculate vector of residual errors as: ⑥ Redefine weight matrix as: ⑧ Repeat steps ④-⑦ m times, and we set m = 3 in this study.
⑨ Set i = k − 1 .Repeat steps ③-⑧ until i equals zero, and the optimal estimates of t is calculated out from the last iteration.

Positioning of UAV's patrolling video frame
By using the optimal estimates of t above resolved, precise geodetic coordinates of each pixel in video frame can be obtained as below: where, (L, B) are geodetic coordinates of a pixel in video frame;P = P A P B P C P D P E P F , P is a transformation matrix provided by producer of region-DOM; X = X + X f t ; X = x, y, 1 T , x, y are pixel coordinates of a pixel in video frame; f , 0, 0, 0, 0, 0, 0 0, 0, 0, 0, 0, 0, 1, x f , y f , x 2 f , x f y f , y 2 f 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0   , x f and y f are column and row numbers of the local area where the pixel is located; t is the optimal rectifying parameters calculated above.Finally, according to Eq. ( 13), precise positioning of UAV's patrolling video sequence images can be realized by calculating geodetic coordinates of each pixel in UAV's patrolling video sequence images.

Case study
Three practical experiments are designed in this study, which includes 3 videos and 2 region-DOMs.Among them, 3 videos are shot by 3 sorties fly of UAV in different areas, including town area, river area and high relief amplitude area. 2 region-DOMs have different spatial resolutions, one of the 2 region-DOMs has a lower spatial resolution, and the other one has a higher spatial resolution.

The first experiment
As shown in Fig. 6, Fig. 6a is region-DOM used in this experiment, which was made on January 31, 2020, with length of 4096 pixels, width of 1792 pixels, and spatial resolution of 0.493663 m/pixels.Video used in this experiment was shot by 1 sortie fly of UAV at an altitude of about 250 m on November 24, 2021, including 3154 frames, with fps (frames per second) of 23.98, spatial resolution of 0.0684932 m/pixels, length of 4096 pixels and width of 2160 pixels in a single frame.In addition, the video was shot in town area.Figure 6b is the 301st frame of the video used in this experiment, and is picked out for algorithm demonstration without loss of generality.POS data of the 301st frame are obtained by IMU (Inertial Measurement Unit) mounted on UAV, where, the center geodetic coordinates are (111.2661504°,34.2428275°), flight altitude is 250.1 m, pitch angle is − 8.3°, roll angle is − 1.3°, and yaw angle is 82.5°.
According to the theory proposed in section "Extraction of datum-DOM", Fig. 6c is datum-DOM that is extracted from Fig. 6a on the basis of POS data of Fig. 6b.
According to the theory proposed in section "Positioning of UAV's patrolling video frame", Fig. 6g is the accurate positioning result of video frame.Figure 6g is obtained by using parameter t and P to calculate geodetic coordinates of each pixel in Fig. 6b.Where, t is obtained by optimal estimation model mentioned above, P is provided by producer of region-DOM, and P = 0.0000053644 0 111.25580102210 −0.0000053644 34.2457553744 .
Figure 6h is a hybrid image formed by superimposing Fig. 6b on Fig. 6a according to their geodetic coordinates.Where, geodetic coordinates of Fig. 6a are pre-acquired, and geodetic coordinates of Fig. 6b are directly assigned from the block-matched-DOM.Among Fig. 6h, the gray area is Fig. 6b and the 20 red points are interest points on Fig. 6b.Distance deviations between the 20 red homologous points in Fig. 6a,b are measured in ArcGIS and listed in Table 1, and the average distance deviation is 4.614 m.
Figure 6i is a hybrid image formed by superimposing Fig. 6g on Fig. 6a according to their geodetic coordinates.Where, geodetic coordinates of Fig. 6a are pre-acquired, and geodetic coordinates of Fig. 6g are obtained by using parameter t and P to calculate geodetic coordinates of each pixel in video frame.Among Fig. 6i, the gray area is Fig. 6g and the 20 red points are interest points on Fig. 6b.In order to improve reliability and generality of the experiment, all the 20 red homologous points are evenly selected from distinctive terrain points and building points without any deliberate adjustment.Distance deviations between the 20 red homologous points in Fig. 6a,g are measured in ArcGIS and listed in Table 1, and the average distance deviation is 2.172 m.
By timekeeping in our program, it takes about 0.206 s to complete extracting of the block-matched-DOM, it takes about 0.330 s to complete calculating of the optimal estimation t , and it takes about 0.101 s to complete calculating of the precise geodetic coordinates of video frame pixel by pixel.That is to say, the total positioning time of this UAV's patrolling video frame is less than 1 s.

The second experiment
As shown in Fig. 7, Fig. 7a is region-DOM used in this experiment, and is same as Fig. 6a.Video used in this experiment was shot by 1 sortie fly of UAV at an altitude of about 250 m on November 25, 2021, including 4687 frames, with fps (frames per second) of 23.98, spatial resolution of 0.0684932 m/pixels, length of 4096 pixels and width of 2160 pixels in a single frame.In addition, the video was shot in river area.Figure 7b is the 3547st frame of the video used in this experiment, and is picked out for algorithm demonstration without loss of generality.POS data of the 3547st frame are obtained by IMU mounted on UAV, where, the center geodetic coordinates are (111.2658703°,34.2406338°), flight altitude is 250.1 m, pitch angle is − 7.1°, roll angle is 2.9°, and yaw angle is − 94.8°.
According to the theory proposed in section "Extraction of datum-DOM", Fig. 7c is datum-DOM that is extracted from Fig. 7a on the basis of POS data of Fig. 7b.
According to the theory proposed in section "construction of robust block-matching algorithm", Fig. 7d is block-matched-DOM that is extracted from Fig. 7c,d is the best matching area for Fig. 7b on Fig. 7c.
According to the theory proposed in Section "Positioning of UAV's patrolling video frame", Fig. 7g is the accurate positioning result of video frame.Figure 7g is obtained by using parameter t and P to calculate geodetic coordinates of each pixel in Fig. 7b.Where, t is obtained by optimal estimation model mentioned above, P is provided by producer of region-DOM, and P = 0.0000053644 0 111.25580102210 −0.0000053644 34.2457553744 .
Figure 7h is a hybrid image formed by superimposing Fig. 7b on Fig. 7a in software according to their geodetic coordinates.Where, geodetic coordinates of Fig. 7a are pre-acquired, and geodetic coordinates of Fig. 7b are directly assigned from the block-matched-DOM.Among Fig. 7h, the gray area is Fig. 7b and the 20 red points are Table 1.Distance deviations between 20 red homologous points in Fig. 6h,i.interest points on Fig. 7b.Distance deviations between the 20 red homologous points in Fig. 7a,b are measured in ArcGIS and listed in Table 2, and the average distance deviation is 5.240 m. Figure 7i is a hybrid image formed by superimposing Fig. 7g on Fig. 7a according to their geodetic coordinates.Where, geodetic coordinates of Fig. 7a are pre-acquired, and geodetic coordinates of Fig. 7g are obtained by using parameter t and P to calculate geodetic coordinates of each pixel in video frame.Among Fig. 7i, the gray area is Fig. 7g and the 20 red points are interest points on Fig. 7b.In order to improve reliability and generality of the experiment, all the 20 red homologous points are evenly selected from distinctive terrain points and building points without any deliberate adjustment.Distance deviations between the 20 red homologous points in Fig. 7a,g are measured in ArcGIS and listed in Table 2, and the average distance deviation is 2.253 m.
By timekeeping in our program, it takes about 0.119 s to complete extracting of the block-matched-DOM, it takes about 0.118 s to complete calculating of the optimal estimation t , and it takes about 0.053 s to complete calculating of the precise geodetic coordinates of video frame pixel by pixel.That is to say, the total positioning time of this UAV's patrolling video frame is less than 1 s.

The third experiment
As shown in Fig. 8, Fig. 8a is region-DOM used in this experiment, which was made on May 13, 2021, with length of 19,266 pixels, width of 14,483 pixels, and spatial resolution of 0.08 m/pixels.Video used in this experiment was shot by 1 sortie fly of UAV at an altitude of about 250 m on November 26, 2021, including 3788 frames, Table 2. Distance deviations between 20 red homologous points in Fig. 7h,i.According to the theory proposed in section "Extraction of datum-DOM", Fig. 8c is datum-DOM that is extracted from Fig. 8a on the basis of POS data of Fig. 8b.
According to the theory proposed in section "construction of robust block-matching algorithm", Fig. 8d is block-matched-DOM that is extracted from Fig. 8c,d is the best matching area for Fig. 8b on Fig. 8c.
According to the theory proposed in Section "Positioning of UAV's patrolling video frame", Fig. 8g is the accurate positioning result of video frame.Figure 8g is obtained by using parameter t and P to calculate geodetic  Figure 8i is a hybrid image formed by superimposing Fig. 8g on Fig. 8a according to their geodetic coordinates.Where, geodetic coordinates of Fig. 8a are pre-acquired, and geodetic coordinates of Fig. 8g are obtained by using parameter t and P to calculate geodetic coordinates of each pixel in video frame.Among Fig. 8i, the gray area is Fig. 8g and the 20 red points are interest points on Fig. 8b.In order to improve reliability and generality of the experiment, all the 20 red homologous points are evenly selected from distinctive terrain points and building points without any deliberate adjustment.Distance deviations between the 20 red homologous points in Fig. 8a,g are measured in ArcGIS and listed in Table 3, and the average distance deviation is 3.619 m.By timekeeping in our program, it takes about 0.118 s to complete extracting of the block-matched-DOM, it takes about 0.122 s to complete calculating of the optimal estimation t , and it takes about 0.074 s to complete calculating of the precise geodetic coordinates of video frame pixel by pixel.That is to say, the total positioning time of this UAV's patrolling video frame is less than 1 s.

Experimental analysis
In the first experiment, spatial resolution of region-DOM is far less than that of video frame, region-DOM's surface universal features are similar with video frame's, and region-DOM's surface specific features and illumination condition are great different from video frame's.From the experimental results, we can see that average positioning deviation of all interest points in Fig. 6h is about 4.614 m, and average positioning deviation of all interest points in Fig. 6i is about 2.172 m.Among them, interest points that are located on roads and lowrise buildings have lower positioning deviations, while interest points that are located on high-rise buildings have higher positioning deviations.
In the second experiment, spatial resolution of region-DOM is still far less than that of video frame, region-DOM's surface universal features are similar with video frame's, region-DOM's surface specific features and illumination condition are greatly different from video frame's, and surface features on the left side of video frame is significantly less than those on the right side.From the experimental results, we can see that average positioning deviation of all interest points in Fig. 7h is about 5.2402 m, and average positioning deviation of all interest points in Fig. 7i is about 2.2532 m.Among them, interest points that are located on roads and low-rise buildings have lower positioning deviations, interest points that are located on high-rise buildings have higher positioning deviations, and interest points that are located on the left side of video frame have higher positioning deviations than those on the right side.
In the third experiment, spatial resolution of region-DOM is similar with that of video frame, region-DOM's surface universal features are similar with video frame's, region-DOM's surface specific features and illumination condition are a little different from video frame's, while there are extensive mountain body shadows on region-DOM.From the experimental results, we can see that average positioning deviation of all interest points in Fig. 8h is about 7.1051 m, and average positioning deviation of all interest points in Fig. 8i is about 3.6193 m.Among them, interest points that are located on roads and low-rise buildings have lower positioning deviations than the first two experiments, while interest points that are located on mountain edges have the highest positioning deviations.By analyzing the above three experiments, following conclusions can be achieved.
(1) Geometrical shape of video frame deformed obviously after accurate positioning, as shown in Figs.6g, 7g and 8g.(2) The average positioning deviations of video frame by using the proposed robust bock-matching algorithm is 5.653 m, and the average positioning deviations of video frame by using the proposed precise polynomialrectifying algorithm is 2.681 m.That is to say, positioning accuracy of video frame can be significantly increased by using the proposed precise polynomial-rectifying algorithm.(3) The red homologous points located on roads and low-rise buildings have a higher positioning accuracy, while the red homologous points located on mountains and high-rise buildings have a lower positioning accuracy.(4) Using region-DOM of high spatial resolution can significantly improve positioning accuracy of video frame, while extensive shadows that are similar to video frame's surface universal features will significantly decrease positioning accuracy of video frame.(5) The proposed model can be applied in various areas, such as, town area, river area, high relief amplitude area and etc.And experiment results show that, the average positioning accuracy in town area and river area is gentle higher than that in high relief amplitude area, as high terrain relief will impose a negative influence on the distortion of imaging.( 6) By timekeeping in our program, the average time of extracting the block-matched-DOM is about 0.148 s, the average time of calculating the optimal estimation t is about 0.19 s, and the average time of calculating all pixels' precise geodetic coordinates in a video frame is about 0.076 s.That is to say, the total positioning time of a UAV's patrolling video frame is less than 1 s.www.nature.com/scientificreports/(7) The proposed methods can be also applied in the field of medical image registration, remote sensing image registration, visual navigation of other industries and etc.Subsequently, the current mathematical model can be optimized significantly by fusing with multi-source data, such as airborne LiDAR point cloud, and then can achieve a higher positioning accuracy and a broader application.

Conclusion
In order to realize real-time positioning of UAV's patrolling video sequence images, a visual positioning model is recommended, including a robust block-matching algorithm and a precise polynomial-rectifying algorithm.First, the robust block-matching algorithm is constructed to realize roughly positioning of UAV's video patrolling video sequence images.The robust block-matching algorithm is divided into 5 steps, including scaling datum-DOM, block-matching roughly based on RGB, Block-matching roughly based on gradient magnitude, block-matching robustly, and extracting block-matched-DOM.Through the above 5 steps, the so-called blockmatched-DOM can be obtained, and rough positioning of UAV's patrolling video sequence images can be realized by assigning geodetic coordinates of each pixel in block-matched-DOM to pixels at the same position in UAV's patrolling video sequence images.
Second, the precise polynomial-rectifying algorithm is constructed to realize accurate positioning of UAV's patrolling video sequence images.The precise polynomial-rectifying algorithm is divided into 5 steps, including constructing polynomials of video frame and that of block-matched-DOM, constructing differential-difference polynomials, constructing precise rectifying equations, constructing optimal estimation model, and calculating geodetic coordinates of interest points in video frame.Through the above 5 steps, the so-called accurate rectification parameters can be obtained, and accurate positioning of UAV's patrolling video sequence images can be realized by using accurate rectification parameters to calculate geodetic coordinates of each pixel in UAV's patrolling video sequence images.
Finally, all the proposed algorithms are verified by three practical experiments, and results indicate that the proposed robust block-matching algorithm can realize positioning of UAV's patrolling video sequence images with an average accuracy of 5 m, even if spatial resolution, surface specific features, illumination and topographic relief of region-DOM are greatly different from that of UAV's patrolling video sequence images.The proposed precise polynomial-rectifying algorithm can further improve positioning accuracy of UAV's patrolling video sequence images with an average accuracy of about 2.5 m.And calculation time of positioning a single UAV's patrolling video sequence image is less than 1 s.

Figure 1 .
Figure 1.Key images involved in this study.

( 4 ) 2 x,y I 2 F x, y x,y I 2 DFigure 3 .
Figure 3.The robust block-matching of datum-DOM and video frame.
represent transpose of a matrix (vector),(x I , y I ) and (x � , y � ) are pixel coordinates in local areas of Fig. 4a,b respectively; T I and T are production time of video frame and that of block-matched-DOM respectively; ) 13:21692 | https://doi.org/10.1038/s41598-023-49001-8www.nature.com/scientificreports/are second-order coefficient matrix of their polynomials respectively; B I = (m I 2 , m I 3 ) T ,B � = (m � 2 , m � 3 ) T , B I and B are first-order coefficient vectors of their polynomials respectively; C I = m I 1 , C = m 1 , C I and C are scalars of their polynomials respectively; m I 1 , m I 2 , m I 3 , m I 4 , m I 5 , m I 6 , m 1 , m 2 , m 3 , m 4 , m 5 , m 6 are parameters of polynomials.

Figure 4 .
Figure 4. Local areas divided in video frame and block-matched-DOM.

Figure 5 .
Figure 5. Small motion displacement from a local area of video frame to the corresponding local area of blockmatched-DOM.
and y f are column and row numbers of a local area in video frame respectively;t = (a 0 , a 1 , a 2 , a 3 , a 4 , a 5 , b 0 , b 1 , b 2 , b 3 , b 4 , b 5 ) T , t is a vector of unknown parameters to be resolved; L = (B I − B � )/2 C I − C � .
points in Fig. 7i/m 0.539 0.354 2.157 0.720 0.411 0.307 0.219 5.540 5.811 6.106 Mean deviation of homologous points in Fig. 7h/m 5.240 Mean deviation of homologous points in Fig. 7i/m 2.253 with fps (frames per second) of 23.98, spatial resolution of 0.0684932 m/pixels, length of 4096 pixels and width of 2160 pixels in a single frame.In addition, the video was shot in high relief amplitude area.Figure 8b is the 901st frame of the video used in this experiment, and is picked out for algorithm demonstration without loss of generality.POS data of the 901st frame are obtained by IMU mounted on UAV, where, the center geodetic coordinates are (111.2504477°,34.2280547°), flight altitude is 250.5 m, pitch angle is − 22.7°, roll angle is − 9.7°, and yaw angle is 123.2°.

Figure 8h is a
Figure 8h is a hybrid image formed by superimposing Fig. 8b on Fig. 8a according to their geodetic coordinates.Where, geodetic coordinates of Fig. 8a are pre-acquired, and geodetic coordinates of Fig. 8b are directly assigned from the block-matched-DOM.Among Fig. 8h, the gray area is Fig. 8b and the 20 red points are interest points on Fig. 8b.Distance deviations between the 20 red homologous points in Fig. 8a,b are measured in ArcGIS and listed in Table 3, and the average distance deviation is 7.105 m.Figure8iis a hybrid image formed by superimposing Fig.8gon Fig.8aaccording to their geodetic coordinates.Where, geodetic coordinates of Fig.8aare pre-acquired, and geodetic coordinates of Fig.8gare obtained by using parameter t and P to calculate geodetic coordinates of each pixel in video frame.Among Fig.8i, the gray area is Fig.8gand the 20 red points are interest points on Fig.8b.In order to improve reliability and generality of the experiment, all the 20 red homologous points are evenly selected from distinctive terrain points and building points without any deliberate adjustment.Distance deviations between the 20 red homologous points in Fig.8a,g are measured in ArcGIS and listed in Table3, and the average distance deviation is 3.619 m.

Table 3 .
Distance deviations between 20 red homologous points in Fig.8h,i.